-
Notifications
You must be signed in to change notification settings - Fork 6.8k
cuDNN support cleanup #15812
cuDNN support cleanup #15812
Conversation
Great to see this cleanup! While I understand the intention of the assert, I feel it's a bit obsolete and makes the code a bit "cluttered" if you understand what I mean. Couldn't we have the assertion in a central place (aka on-init) instead of re-asserting the same thing over and over? The minimum CuDNN version is 7 and thus we treat it as default. That might make the code a bit more cleaner. But I don't have strong feelings about that tbh. |
I did consider the 'single assert' approach you mentioned. I ultimately advocated sprinkling a handful of asserts within the files having the actual dependencies. See #15449 (comment). We now have 17 assert lines. I don't think every cuDNN operator impl needs to state its dependency, since some are fine with very old cuDNN versions. How about as we advance the cuDNN support level, we clean out the old asserts that mention earlier versions that aren't out 'in the wild' anymore? Since you say you don't feel strongly, I prefer to leave the asserts as is. We can always move toward fewer asserts going forward. I notice now the stats on this PR- a net of about 750 lines cleaned out! |
@@ -44,6 +44,10 @@ namespace op { | |||
*/ | |||
template <typename CuDNNAlgoType> | |||
class CuDNNAlgo { | |||
// Compile guard needed to get past non-nvcc compilation of cudnn_algoreg.cc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a fix that removes the compile guard. Before the fix and without the guard, the following step of the compile failed:
g++ ... -DMSHADOW_USE_CUDNN=1 ... -c cudnn_algoreg.cc
The problem is that the cudnn_algoreg-inl.h and cudnn_algoreg.cc files look at MXNET_USE_CUDNN, not MSHADOW_USE_CUDNN. The header file ./include/mxnet/libinfo.h has the vital missing line that equates the two:
#define MXNET_USE_CUDNN MSHADOW_USE_CUDNN
So this was a case of not adhering to the rule 'include what you use.' The header file ./src/common/cuda_utils.h was referring to MXNET_USE_CUDNN, but not including any header file that would ensure libinfo.h was included. Now fixed.
Now that MSHADOW is part of MXNET and no longer a submodule, is there a benefit to consolidating the dual naming as in {MSHADOW,MXNET}_USE_CUDNN?
@@ -44,6 +44,9 @@ enum CuDNNBatchNormOpAuxiliary {kMovingMean, kMovingInvVar}; | |||
#if defined(__CUDACC__) | |||
template<typename DType> | |||
class CuDNNBatchNormOp { | |||
// This file was formerly included in batch_norm.cu only when CUDNN_MAJOR >= 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment may be useful for review of this PR, but I don't think it makes sense to have it going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take this out. Also explanations of what features of cuDNN are being used that require the assert, as you mention below.
I'm not sure how I feel about the comments like "This file uses X and Y feature of cudnn 7" - I don't think it is actually useful for maintenance. |
Sounds good to me, thanks for the explanation! Yeah it's certainly a great cleanup, no doubt about that :) |
#if CUDNN_VERSION >= 7200 | ||
if (GetEnvAllowTensorCore() && GetEnvAllowTensorCoreConversion() && | ||
(DataType<DType>::kFlag != kFloat16)) | ||
math_type = CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking for this PR, but this change reminded me that I believe there's a bug here. IIRC the problem was that math_type can be reset but auto-tuning. Issue I opened a while ago that I haven't had time to follow up on:
"Auto-tuning is overwriting the math mode at convolution tuning time. Probably the right thing to do when implementing TCs but it's preventing the conversion math type from being used. We'll have to think about the long-term fix for this, but I've currently commented out the math type reset locally and I'm trying to verify this cudnn feature provides a significant speedup before moving forward. "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nit that we could remove the comment already highlighted, but the PR LGTM.
* Remove old cudnn support (<v7). * Simplify conv impl selection. * Remove comments justifying STATIC_ASSERT_CUDNN_VERSION_GE.
Description
Background: Now that MXNet 1.5 no longer compiles against cuDNN v6 and earlier, any cuDNN code that special-cases those earlier versions is needlessly cluttering the codebase. PR #15449 introduced the STATIC_ASSERT_CUDNN_VERSION_GE() macro as a way to document and enforce the requirements of the codebase locally in the files that create those dependencies.
This PR removes code that needlessly adapts operators to cuDNN v6 (and earlier) APIs and uses STATIC_ASSERT_CUDNN_VERSION_GE() to enforce the restriction to operate only against cuDNN v7 and later. This cleanup helps create a concise and understandable codebase able to accept further feature improvements.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments